library(knitr)
library(scater)
library(scran)
library(cowplot)
knitr::opts_chunk$set(cache = TRUE, warning = FALSE)

1 Batch effects

1.1 What are batch effects?

  • Common and powerful source of variation in high-throughput experiments
  • Hidden factors that can quantitatively alter the observed measurements across conditions unrelated to the biological or scientific variables in a study
  • Can remain even after “normalizing” the data. e.g. Figure 1 from Leek et al. (2010):

1.1.1 Surrogates for other sources of information

  • Most common to define batch of samples based when/how samples were processed
    • e.g. sample processing time/date, sequencing date, different protocols in the laboratory
  • These are only surrogates for other sources of variation not recorded
    • e.g. laboratory-specific effects, reaction temperatures, PCR reagents/conditions etc

1.2 Experimental design in scRNA-Seq

See Bacher and Kendziorski (2016) on ‘Design and computational analysis of single-cell RNA-sequencing experiments’ for detailed information on experimental design in scRNA-Seq.

  • Similar to bulk RNA-Seq, important to randomize (or balance) design across as many factors as possible to not introduce artifacts imposed from sample preparation or data collection.

  • In some cases, a fully randomized experiment is not possible and compromises are made in practice:
    • e.g. cost, time, capacity, limited samples, platforms to isolate and sequence cells
    • Most frequently, we cannot avoid batching cells, so replication is critical (samples must be processed in multiple batches) (more on this later).
  • Potential limiting factors in experimental design that must be balanced with experimental design
    • Protocols to isolate cells [see Saliba et al. (2014) and Kolodziejczyk et al. (2015)]
    • Protocols to extract mRNA from isolate cells [see Grun and van Oudenaarden (2015), Saliba et al. (2014) and Kolodziejczyk et al. (2015)]
    • Protocols for extracting RNA from isolate cell and conver to cDNA
      • Can vary with respect to transcript coverage and strand specificity
    • Choice of sequencing platform, time and budget constraints
    • Choice to use synthetic spike-ins (or not) or unique molecular identifiers UMIs [see Stegle et al. (2015)].
      • Spike-ins are added at a high concentration and can take up a large percentage of “read landscape” in sequencing.
      • UMIs can remove amplification bias, but are 5’ or 3’ end biased (bad for isoform or allele-specific expression).
  • Trade-off between number of cells and sequencing depth:
    • Not many papers on determining minimum number of cells to sequence for a given biological question
    • Most papers focus on defining minimum sequencing depths to observe majority of expressed genes [e.g. Shalek et al. (2014) and Tung et al. (2016)]

1.3 Batch effects in scRNA-Seq

Batch effects occur when cells from one biological group or condition are cultured, captured and sequenced separate from cells in a second condition.

1.3.1 The big problem with batch effects

  • When batch effects are correlated (or confounded) with an outcome of interest
  • In the figure in Section 1 with the bladder samples, only normal samples were considered.
  • Consider the case where all normal samples are processed in “Batch 1” and all cancer samples are processed in “Batch 2”.
    • Impossible to determine if variation is due to cancer/normal status or batch effects.

In a completely confounded study, it is not possible to determine if the biological condition or batch effects are driving the observed variation. In contrast, incorporating biological replicates across in the experimental design and processing the replicates across multiple batches permits observed variation to be attributed to biology or batch effects [Hicks et al. (2015)].

This idea was also proposed more concretely for the C1 platform in Tung et al. (2016). The idea was to take isolate and sequence cells from multiple individuals within a C1 plate and obtain multiple C1 replicates to observe batch effect across C1 plates.

1.3.2 Using FASTQ file as surrogate for “batch”

Although batch information is not always included in the experimental annotations that are publicly available, one can extract surrogate variables from the raw sequencing (FASTQ) files.

For example, this is a screenshot of a FASTQ file. Each read contains this header information.

This is how we define batches of cells, but keep in mind this is only a surrogate for other sources of information.

1.3.3 How to correct for batch effects?

First, let’s consider the question of where are we at in terms of methods for normalization and dealing with technical variation in scRNA-Seq data?

In my opinion, it’s a bit like being in the “Wild West” (very much an open-ended research question)

  • We’re still in the early stages of dealing with the technical noise that we see in scRNA-Seq data through the creation of new normalization and analysis methods, but there is still lots of room for improvement
    • e.g. what methods are considered “the best” since most of these methods have compared to methods developed for bulk RNA-Seq.
  • To add on top of that, the technologies that we’re using to isolate, sort and process the cells are being developed at a fairly fast pace. Just because new technologies are emerging doesn’t mean that technical variation is being eliminated. In fact, it’s usually just the opposite. When new technologies emerge, there is usually an entirely new set of issues that need to be dealt with.
    • e.g. the transition from microarrays to bulk RNA-Seq

OK so what about batch effects in scRNA-Seq data? There is some good news and some bad news…

  • Bad news: Batch effects can be a big problem in scRNA-Seq data (but not always)
  • Good news: Batch effects and methods to correct for batch effects have been around for many years (lots of places to start)
  • Bad news: Poor experimental design is a big limiting factor.
    • also, more complicated because of sparsity (biology and technology), capture efficiency, etc
  • Good news: Increased awareness about good experimental design. New methods specific for scRNA-Seq are being developed.

What have been the approaches to computationally correct for batch effects in bulk high-throughput experiments?

  1. Explicitly model batch effects in a statistical model e.g. fixed/random effects
    • This only works assuming you have a well-designed randomized experiment with replicates.
    • Even then, it’s not straight-forward how to do with scRNA-Seq data.
    • Additional noise observed at the single-cell level, confounded study designs, etc
  2. Account/correct/adjust data in “normalization” steps
    • But batch effects can often remain in data even after “normalization” (e.g. bladder samples and more on this later too)

See Bacher and Kendziorski (2016), Tung et al. (2016), an F1000 workflow for the analysis of scRNA-seq data and slides from Davis McCarthy presented at Genome Informatics 2016 for more information on computationally correcting for batch effects in scRNA-Seq data with good experimental designs.

2 Case-study

OK, let’s consider a case-study with some scRNA-Seq data with replicates.

2.1 Tung et al. (2016) human scRNA-Seq data

In this tutorial, we will use the human scRNA-Seq data from Tung et al. (2016), which is available in a GitHub repository.

Below, we will read in the files directly from the web, so no need to download the files. If you would like to have the files on your local computer, there are a few ways of getting the files. If you are familiar with git, you can download the repository by using git commands:

git clone https://github.com/jdblischak/singleCellSeq.git

Alternatively, you can also use the the command wget (or just manually download the files):

wget https://github.com/jdblischak/singleCellSeq/raw/master/data/annotation.txt -P data
wget https://github.com/jdblischak/singleCellSeq/raw/master/data/molecules-filter.txt -P data
wget https://github.com/jdblischak/singleCellSeq/raw/master/data/quality-single-cells.txt -P data

In the Tung et al. paper, the authors investigated the technical variation associated with sample processing using the single cell Fluidigm C1 platform by processing cells from three C1 replicates from three human induced pluripotent stem cell (iPSC) lines. Unique molecular identifiers (UMIs) were added to samples to account for amplifcation bias.

“We found that the major source of variation in the gene expression data was driven by genotype, but we also observed substantial variation between the technical replicates.”

2.2 scater and scran R/Bioconductor packages

For this tutorial we will use R/Bioconductor packages to investigate batch effects in scRNA-Seq data. The main packages we will use are the scater and scran. Much of the code in this tutorial was adapted from the scater tutorials available on GitHub.

Main feature of the scran package that we’ll discuss:

  • cell-specific scaling normalization factors (similar to DESeq2 or edgeR scaling factors for bulk RNA-Seq)
  • the idea here is to calculate these cell-specific normalization factors by pooling across cells with many zero counts [see Lun et al. (2016) for more information]

Main features of the scater package that we’ll discuss:

  • new SCESet (single-cell ExpressionSet) class in Bioconductor
  • compatible with counts or other transformed expression values (e.g. TPM, FPKM, CPM, etc)
  • easy to calculate QC metrics (feature-level and cell-level),
  • filter features and cells based on QC metrics
  • exploratory data analysis/data visualization
  • methods to investigate and remove uninteresting covariates (e.g. batch effects)
  • See McCarthy et al. (2016) for more information

For more information on other features in scater, see below:

2.3 Load data in R

We will download the Tung et al. data directly from the GitHub reposistory and create an SCESet object. Code in this subsection is adapted from the code provided by the authors.

First load the data for annotation data for cells:

web_url <- "https://github.com/jdblischak/singleCellSeq/raw/master/data"
anno <- read.table(file.path(web_url, "annotation.txt"), 
                   header = TRUE, stringsAsFactors = FALSE)
anno <- as.data.frame(anno)
anno <- anno[, !(colnames(anno) %in% "well")]
rownames(anno) <- anno$sample_id

head(anno[1:20,])
##                individual replicate      batch      sample_id
## NA19098.r1.A01    NA19098        r1 NA19098.r1 NA19098.r1.A01
## NA19098.r1.A02    NA19098        r1 NA19098.r1 NA19098.r1.A02
## NA19098.r1.A03    NA19098        r1 NA19098.r1 NA19098.r1.A03
## NA19098.r1.A04    NA19098        r1 NA19098.r1 NA19098.r1.A04
## NA19098.r1.A05    NA19098        r1 NA19098.r1 NA19098.r1.A05
## NA19098.r1.A06    NA19098        r1 NA19098.r1 NA19098.r1.A06

We can count the number of cells sequenced from each individual and across the three technical replicates.

table(anno$individual, anno$replicate)
##          
##           r1 r2 r3
##   NA19098 96 96 96
##   NA19101 96 96 96
##   NA19239 96 96 96

The authors provide a table for which cells were considered low vs high “quality” cells. To read more about the criteria used to define the filter, see here. So we will read in this data and add TRUE/FALSE logical column representing which cells were considered low quality.

quality_single_cells <- read.table(file.path(web_url, "quality-single-cells.txt"), header = FALSE)
anno$low_quality_cells <- !(rownames(anno) %in% quality_single_cells$V1)
table(anno$low_quality_cells)
## 
## FALSE  TRUE 
##   564   300

Next, we read in the count data for endogenous mRNA and ERCC genes

counts <- read.table(file.path(web_url, "molecules-raw-single-per-sample.txt"), header = FALSE)
counts <- as.data.frame(t(counts), stringsAsFactors = FALSE)
sample.id <- paste(counts[1,-1], counts[2,-1], counts[3,-1], sep=".")
counts <- counts[-c(1:3),] # remove top three rows
geneName <- counts$V1 # save gene names
counts <- counts[,-1] # remove column with gene names
counts <- apply(counts, 2, as.numeric) # transform from characters to numerics
counts <- as.data.frame(counts) # convert to a data frame
rownames(counts) <- geneName # add gene names a row names
colnames(counts) <- sample.id # add sample names as column headers
head(counts[,1:5]) # show first 6 rows and first 5 cells
##                 NA19098.r1.A01 NA19098.r1.A02 NA19098.r1.A03
## ENSG00000186092              0              0              0
## ENSG00000237683              0              0              0
## ENSG00000235249              0              0              0
## ENSG00000185097              0              0              0
## ENSG00000269831              0              0              0
## ENSG00000269308              0              0              0
##                 NA19098.r1.A04 NA19098.r1.A05
## ENSG00000186092              0              0
## ENSG00000237683              1              0
## ENSG00000235249              0              0
## ENSG00000185097              0              0
## ENSG00000269831              0              0
## ENSG00000269308              0              0

2.3.1 Create SCESet object

pd <- new("AnnotatedDataFrame", anno)
fd <- data.frame(gene = rownames(counts))
rownames(fd) <- rownames(counts)
fd <- new("AnnotatedDataFrame", fd)
sce_tung_raw <- newSCESet(countData = counts, phenoData = pd, featureData = fd)

For purposes of this tutorial, we will remove the biological variation and only focus on the technical variation in the cells by picking one individual (NA19101) to explore batch effects and technical biases in this data.

sce_tung_raw <- sce_tung_raw[, pData(sce_tung_raw)$individual == "NA19101"]
sce_tung_raw
## SCESet (storageMode: environment)
## assayData: 20419 features, 288 samples 
##   element names: counts, cpm, exprs, is_exprs 
## protocolData: none
## phenoData
##   sampleNames: NA19101.r1.A01 NA19101.r1.A02 ... NA19101.r3.H12
##     (288 total)
##   varLabels: individual replicate ... low_quality_cells (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSG00000186092 ENSG00000237683 ... ERCC-00171
##     (20419 total)
##   fvarLabels: gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

There are 20419 genes and 288 cells in this data set.

dim(sce_tung_raw)
## Features  Samples 
##    20419      288

Recommended by scater package: “since the data were produced using UMIs, cpm (counts-per-million) should be the correct units for analysis (transcript length becomes irrelevant as we are counting molecules directly).”"

2.3.2 QC metrics

Next, we calculate QC metrics with calculateQCMetrics() function in scater. We can define which genes are the control genes (“feature controls”). For example, here we define the ERCC spike-ins and mitochondrial genes as feature controls.

is.ercc <- grepl("ERCC", featureNames(sce_tung_raw))

## create a list of mitochondrial genes (13 protein-coding genes)
##    provided by authors
## MT-ATP6, MT-CYB, MT-ND1, MT-ND4, MT-ND4L, MT-ND5, MT-ND6, MT-CO2, 
## MT-CO1, MT-ND2, MT-ATP8, MT-CO3, MT-ND3
mtgene <- c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888",
            "ENSG00000198886", "ENSG00000212907", "ENSG00000198786", 
            "ENSG00000198695", "ENSG00000198712", "ENSG00000198804", 
            "ENSG00000198763","ENSG00000228253", "ENSG00000198938", 
            "ENSG00000198840")
is.mito <- featureNames(sce_tung_raw) %in% mtgene
featureNames(sce_tung_raw)[match(mtgene, featureNames(sce_tung_raw))] <- paste0("mt-", mtgene)

sce_tung_raw <- calculateQCMetrics(sce_tung_raw, 
                    feature_controls = list(ERCC = is.ercc, mt = is.mito))
fData(sce_tung_raw)$endog_genes <- !fData(sce_tung_raw)$is_feature_control

You can explore what QC metrics are now available as summaries for each cell using colnames(pData(sce_tung_raw)) and for each gene using colnames(fData(sce_tung_raw)).

colnames(pData(sce_tung_raw))
colnames(fData(sce_tung_raw))

2.3.2.1 Filter genes

We can set minimum QC thresholds for a gene to be expressed using the is_exprs() function. The idea is to exclude genes with an overall, sparse expression level.

fData(sce_tung_raw)$use <- rowSums(is_exprs(sce_tung_raw)) >= 25
table(fData(sce_tung_raw)$use)
## 
## FALSE  TRUE 
##  7727 12692

Here we keep 12692 genes.

Another nice feature is the plotQC() function which provides several useful QC plots including one that plots the number of reads consumed by the top 50 most expressed genes. These are typically spike-ins, mitochondrial genes (or housekeeping genes). We can see that many reads are being are taken up by uninteresting biology.

plotQC(sce_tung_raw[fData(sce_tung_raw)$use, ], 
        type = "highest-expression", col_by_variable = "replicate" )

2.3.2.2 Filter cells

scater provides a suite of tools to flag or identify poor quality cells. The workhorse for this is based on the automated QC metrics generated using the calculateQCMetrics() function. You can also filter based on “any additional sequencing metrics from sequencing aligner/mapping software, and additional cell phenotypes such as from imaging”.

For example, in this data set the authors provided a list of low vs high “quality cells” that they determined and used in their analysis. We can explore some summary statistics of these cells. Other metrics that are useful are % reads mapped to ERCC spike-ins, total counts and total features.

For this, we will use the plotPhenoData() function to explore specific sample meta-data values. For example, in the plots below we can see how different QC metrics distinguish the cells.

p1 <- plotPhenoData(sce_tung_raw, aes_string(x = "total_counts", 
                                 y = "pct_counts_feature_controls", 
                                 colour = "low_quality_cells"))
p2 <- plotPhenoData(sce_tung_raw, aes_string(x = "total_features",
                                  y = "pct_counts_feature_controls",
                                  colour = "low_quality_cells"))
plot_grid(p1, p2, ncol = 2)

Use QC metrics to select a subset of cells for use.

sce_tung_raw$use <- (!sce_tung_raw$low_quality_cells)
table(sce_tung_raw$use)
## 
## FALSE  TRUE 
##    87   201

This would lead us to drop 87 cells from this dataset.

Another useful plot to visualize cells is to use the plot() function. This function allows you to visualise all cells as a cumulative proportion of reads per cell. You can see from the plot below that the cells not used in the analysis have more reads mapping to fewer features (or large proportion of their library accounted for by a handful of very highly-expressed genes) compared to cells included in the analysis.

plot(sce_tung_raw, block1 = "replicate", colour_by = "low_quality_cells", 
     exprs_values = "counts")

2.4 Dimension reduction

There are lots of functions in scater to visualize the high-dimensional data into a low dimension space (e.g. 2D). For this tutorial, we will just focus on two: PCA and t-SNE.

2.4.1 Principal Components Analysis (PCA)

First, we use the plotPCA() function to show how the cells from the one individual cluster along the first two principal components (colored by replicate, size of points correspond to the number of features or number of detected genes).

## only use endogenous genes 
plotPCA(sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use, 
                     sce_tung_raw$use], 
        colour_by = "replicate", size_by = "total_features")

We see that the first PC seems to separate the cells by which plate was used (replicate) and the second PC seems to correspond to the total number of features.

2.4.2 t-SNE

We can also look at a t-SNE plot. In scater the perplexity parameter defaults to floor(ncol(sce_tung_filtered)/5) but it can be altered. Here we consider four different perplexity parameters:

p1 <- plotTSNE(sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use,
                            sce_tung_raw$use], 
               colour_by = "replicate", size_by = "total_features",
               rand_seed = 20161130, perplexity = 2) + 
      labs(title = "Perplexity = 2")
p2 <- plotTSNE(sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use,
                            sce_tung_raw$use], 
               colour_by = "replicate", size_by = "total_features",
               rand_seed = 20161130, perplexity = 5) + 
      labs(title = "Perplexity = 5")
p3 <- plotTSNE(sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use,
                            sce_tung_raw$use], 
               colour_by = "replicate", size_by = "total_features",
               rand_seed = 20161130, perplexity = 25) + 
      labs(title = "Perplexity = 25")
p4 <- plotTSNE(sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use,
                            sce_tung_raw$use], 
               colour_by = "replicate", size_by = "total_features",
               rand_seed = 20161130) + labs(title = "Perplexity (default)")
plot_grid(p1, p2, p3, p4, ncol = 2)

2.4.2.1 Helpful resource for understanding t-SNE plots

If you would like more information on understanding what is t-SNE?, what it can show? and what it cannot show?, I would recommend reading through these illustrative examples.

A few highlights:

The goal is to take a set of points in a high-dimensional space and find a faithful representation of those points in a lower-dimensional space, typically the 2D plane. The algorithm is non-linear and adapts to the underlying data, performing different transformations on different regions. Those differences can be a major source of confusion.

The perplexity paramter is a tuneable parameter, “which says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has.

Cluster sizes in a t-SNE plot mean nothing”. A main feature of t-SNE is to equalize the density of points. “As a result, it naturally expands dense clusters, and contracts sparse ones, evening out cluster sizes.

Distances between clusters might not mean anything”. This is highly dependent on the perplexity parameter.

2.5 Investigate explanatory variables

After filtering cells and genes, next we want to explore how technical noise relates to the variablity we see in the data. This can help us figure out how to normalize the data before downstream analyses. We can explore the effects of variables recorded during the experiment (e.g. replicates) and the QC metrics we computed.

Based on the PCA & t-SNE plots, we might guess that replicate and total_features (number of detected genes) might be two good explantory variables to start. In this example, it is unlikely that the total number of detected genes is related to biological variation (cells should be same size and only considering one individual). For example, the “number of genes detected often has a strong technical component due to variably recovered RNA, reverse transcription, or library amplification. Its effect can also be notably non-linear,affecting low expressed and high expressed genes differently.” (scater tutorials)

This is not always the case. Sometimes the number of detected genes is related to biology (e.g. different cell types, cell cycle) and you must proceed with caution in normalization.

OK. So we can use the plotQC() function “to explore the marginal % variance explained (per gene) of the various technical factors”.

plotQC(sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use, 
                    sce_tung_raw$use], 
       type = "find-pcs", variable = "replicate")

plotQC(sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use, sce_tung_raw$use], 
       type = "find-pcs", variable = "total_features")

We see the first PC is strong related to the replicate and the second PC is strongly related to the total number of genes detected in a cell. We can create these plots for each variable in the pData(sce_tung_raw) SCESet object one-by-one to explore the relative importance.

To speed things up, you can use type = "expl" argument to plotQC() “computes the marginal R-squared for each variable in the when fitting a linear model regressing expression values for each gene against just that variable, and displays a density plot of the gene-wise marginal R-squared values for the variables. The default approach looks at all variables in the slot of the object and plots the top nvars_to_plot variables (default is 10). The density curves for marginal R-squared show the relative importance of different variables for explaining variance in expression between cells.”

plotQC(sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use, sce_tung_raw$use], 
       type = "expl")

We see replicate and total_features seem to be the most important explanatory variables. In the next section, we will see how to condition them out in a normalization step.

Alternatively, they could be included in a statistical model downstream.

2.6 Normalization

In this section, we use the normaliseExprs() function to remove the variation from the explanatory variables we found in the previous section. The idea is to create a design matrix with the explantory variables and provide that to the normaliseExprs() function. Normalized expression values are returned for each gene (residuals from a linear model fitted with the design matrix) after size-factor normalizaion.

2.6.1 Size-factor normalization using scran

There are a couple of ways to calculate size-factors (many developed for bulk RNA-Seq methods e.g. TMM in edgeR, RLE in DESeq). Recently there was a method developed to calculate size factors specifically for scRNA-Seq [see Lun et al (2016)], which has been implemented in the scran R/Bioconductor package. This what we will use here.

sce_tung_qc <- sce_tung_raw[fData(sce_tung_raw)$use | fData(sce_tung_raw)$is_feature_control, 
                            sce_tung_raw$use]
endog_genes <- !fData(sce_tung_qc)$is_feature_control

## size factor normalisation with scran 
qclust <- quickCluster(sce_tung_qc)
sce_tung_qc <- computeSumFactors(sce_tung_qc, clusters = qclust)
hist(sce_tung_qc$size_factor, 
     main = "cell-specific size factors from scran", 
     xlab = "size factor")

sce_tung_qc <- normalise(sce_tung_qc)

We can plot the cells along the first two PCs using only endogenous genes to see the impact of the cell-specific normalization factors from scran.

## only endogenous genes are used
plt_pca_prenorm <- plotPCA(
    sce_tung_raw[fData(sce_tung_raw)$endog_genes & fData(sce_tung_raw)$use, sce_tung_raw$use],
    exprs_values = "exprs", colour_by = "replicate", size_by = "total_features") +
    ggtitle("PCA - no normalisation") +
    theme(legend.position = "bottom")
plt_pca_endog_norm <- plotPCA(
    sce_tung_qc[fData(sce_tung_qc)$endog_genes, ],
    exprs_values = "exprs", colour_by = "replicate", size_by = "total_features") +
    ggtitle("PCA - endogenous size-factor normalisation") +
    theme(legend.position = "bottom")
plot_grid(plt_pca_prenorm, plt_pca_endog_norm, ncol = 2)

2.6.2 Regressing out explanatory variables

The last step is to use the normaliseExprs() function to remove the variation from the explanatory variables after size-factor normalization.

As a reminder, the idea is to create a design matrix with the explantory variables and provide that to the normaliseExprs() function. Normalized expression values are returned for each gene (residuals from a linear model fitted with the design matrix) after size-factor normalizaion.

The explantory variables we are interested in regressing out are replicate and total_features (as this is assumed to be a technical variable in this case).

If you would like to know more details on this part, I extracted portions of text from from the scater tutorial:

“The residuals of a linear regression are returned to norm_exprs(object). Here, the exprs values in the SCESet are assumed to be on a log scale - if they are not, then this regression approach may not work reliably.”

“To regress out a set of explanatory variables, we simply need to supply a”design matrix" to normaliseExprs() that contains the matrix of values for the variables we wish to regress out. This is easy to construct in R with the built-in model.matrix function. To retain all normalised expression values, here we save the normalised expression residuals to a new object (norm_exprs_resid) and then add these values to a slot with the same name in the SCESet object:"

design <- model.matrix(~ sce_tung_qc$replicate + sce_tung_qc$total_features)
sce_tung_qc <- normaliseExprs(sce_tung_qc, design = design, exprs_values = "exprs",
                                method = "none", feature_set = endog_genes,
                                return_norm_as_exprs = FALSE)
set_exprs(sce_tung_qc, "norm_exprs_resid") <- norm_exprs(sce_tung_qc)
plt_pca_endog_norm_resid <- plotPCA(
    sce_tung_qc[fData(sce_tung_qc)$endog_genes, ],
    exprs_values = "norm_exprs_resid",
    size_by = "total_features",  colour_by = "replicate") +
    ggtitle("PCA - endogenous size-factor normalisation residuals") +
    theme(legend.position = "bottom")
plot_grid(plt_pca_prenorm, plt_pca_endog_norm, plt_pca_endog_norm_resid, ncol = 2)